摘要 :
We present Siren, an interactive tool for mining and visualizing geospatial redescriptions. Redescription mining is a powerful data analysis tool that aims at finding alternative descriptions of the same entities. For example, in ...
展开
We present Siren, an interactive tool for mining and visualizing geospatial redescriptions. Redescription mining is a powerful data analysis tool that aims at finding alternative descriptions of the same entities. For example, in biology, an important task is to identify the bioclimatic constraints that allow some species to survive, that is, to describe geographical regions in terms of both the fauna that inhabits them and their bioclimatic conditions. Using Siren, users can explore geospatial data of their interest by visualizing the redescriptions on a map, interactively edit, extend and filter them. To demonstrate the use of the tool, we focus on climatic niche-finding over Europe, as an example task. Yet, Siren is by no means limited to a particular dataset or application.
收起
摘要 :
In many areas of science, scientists need to find distinct common characterizations of the same objects and, vice versa, to identify sets of objects that admit multiple shared descriptions. For example, in biology, an important ta...
展开
In many areas of science, scientists need to find distinct common characterizations of the same objects and, vice versa, to identify sets of objects that admit multiple shared descriptions. For example, in biology, an important task is to identify the bioclimatic constraints that allow some species to survive, that is, to describe geographical regions both in terms of the fauna that inhabits them and of their bioclimatic conditions. In data analysis, the task of automatically generating such alternative characterizations is called redescription mining.
收起
摘要 :
Frequent itemset mining is an important problem in the data mining area with a wide range of applications. Many decision support systems need to support online interactive frequent itemset mining, which is a challenging task becau...
展开
Frequent itemset mining is an important problem in the data mining area with a wide range of applications. Many decision support systems need to support online interactive frequent itemset mining, which is a challenging task because frequent itemset mining is a computation intensive repetitive process. One solution is to precompute frequent itemsets. In this paper, we propose a compact disk-based data structure—CFP-tree to store precomputed frequent itemsets on a disk to support online mining requests. The CFP-tree structure effectively utilizes the redundancy in frequent itemsets to save space. The compressing ratio of a CFP-tree can be as high as several thousands or even higher. Efficient algorithms for retrieving frequent itemsets from a CFP-tree, as well as efficient algorithms to construct and maintain a CFP-tree, are developed. Our performance study demonstrates that with a CFP-tree, frequent itemset mining requests can be responded to promptly.
收起
摘要 :
The problems of recurrent and anomalous pattern discovery in time series, e.g., motifs and discords, respectively, have received a lot of attention from researchers in the past decade. However, since the pattern search space is us...
展开
The problems of recurrent and anomalous pattern discovery in time series, e.g., motifs and discords, respectively, have received a lot of attention from researchers in the past decade. However, since the pattern search space is usually intractable, most existing detection algorithms require that the patterns have discriminative characteristics and have its length known in advance and provided as input, which is an unreasonable requirement for many real-world problems. In addition, patterns of similar structure, but of different lengths may co-exist in a time series. Addressing these issues, we have developed algorithms for variable-length time series pattern discovery that are based on symbolic discretization and grammar inference-two techniques whose combination enables the structured reduction of the search space and discovery of the candidate patterns in linear time. In this work, we present GrammarViz 3.0-a software package that provides implementations of proposed algorithms and graphical user interface for interactive variable-length time series pattern discovery. The current version of the software provides an alternative grammar inference algorithm that improves the time series motif discovery workflow, and introduces an experimental procedure for automated discretization parameter selection that builds upon the minimum cardinality maximum cover principle and aids the time series recurrent and anomalous pattern discovery.
收起
摘要 :
Protein-protein interactions are observed in various biological processes. They are important for understanding the underlying molecular mechanisms and can be potential targets for developing small-molecule regulators of such proc...
展开
Protein-protein interactions are observed in various biological processes. They are important for understanding the underlying molecular mechanisms and can be potential targets for developing small-molecule regulators of such processes. Previous studies suggest that certain residues on protein-protein binding interfaces are ″hot spots″. As an extension to this concept, we have developed a residue-based method to identify the characteristic interaction patterns (CIPs) on protein-protein binding interfaces, in which each pattern is a cluster of four contacting residues. Systematic analysis was conducted on a nonredundant set of 1,222 protein-protein binding interfaces selected out of the entire Protein Data Bank. Favored interaction patterns across different protein-protein binding interfaces were retrieved by considering both geometrical and chemical conservations. As demonstrated on two test tests, our method was able to predict hot spot residues on protein-protein binding interfaces with good recall scores and acceptable precision scores. By analyzing the function annotations and the evolutionary tree of the protein-protein complexes in our data set, we also observed that protein-protein interfaces sharing common characteristic interaction patterns are normally associated with identical or similar biological functions.
收起
摘要 :
Over the past decade, an increasing number of efficient algorithms have been proposed to mine frequent patterns by satisfying the minimum support threshold. Generally, determining an appropriate value for minimum support threshold...
展开
Over the past decade, an increasing number of efficient algorithms have been proposed to mine frequent patterns by satisfying the minimum support threshold. Generally, determining an appropriate value for minimum support threshold is extremely difficult. This is because the appropriate value depends on the type of application and expectation of the user. Moreover, in some real-time applications such as web mining and e-business, finding new correlations between patterns by changing the minimum support threshold is needed. Since rerunning mining algorithms from scratch is very costly and time-consuming, researchers have introduced interactive mining of frequent patterns. Recently, a few efficient interactive mining algorithms have been proposed, which are able to capture the content of transaction database to eliminate possibility of the database rescanning. In this paper, we propose a new method based on prime number and its characteristics mainly for interactive mining of frequent patterns. Our method isolates the mining model from the mining process such that once the mining model is constructed; it can be frequently used by mining process with various minimum support thresholds. During the mining process, the mining algorithm reduces the number of candidate patterns and comparisons by using a new candidate set called candidate head set and several efficient pruning techniques. The experimental results verify the efficiency of our method for interactive mining of frequent patterns.
收起
摘要 :
Sequential pattern mining finds applications in numerous diverging fields. Due to the problem's combinatorial nature, two main challenges arise. First, existing algorithms output large numbers of patterns many of which are uninter...
展开
Sequential pattern mining finds applications in numerous diverging fields. Due to the problem's combinatorial nature, two main challenges arise. First, existing algorithms output large numbers of patterns many of which are uninteresting from a user's perspective. Second, as datasets grow, mining large numbers of patterns gets computationally expensive. There is, thus, a need for mining approaches that make it possible to focus the pattern search towards directions of interest. This work tackles this problem by combining interactive visualization with sequential pattern mining in order to create a "transparent box" execution model. We propose a novel approach to interactive visual sequence mining that allows the user to guide the execution of a pattern-growth algorithm at suitable points through a powerful visual interface. Our approach (1) introduces the possibility of using local constraints during the mining process, (2) allows stepwise visualization of patterns being mined, and (3) enables the user to steer the mining algorithm towards directions of interest. The use of local constraints significantly improves users' capability to progressively refine the search space without the need to restart computations. We exemplify our approach using two event sequence datasets; one composed of web page visits and another composed of individuals' activity sequences.
收起
摘要 :
The classical applications of Association Rule Mining (ARM) are market analysis, network traffic analysis, and web log analysis where strategic decisions are made by analyzing the frequent itemsets from a large pool of data. Datas...
展开
The classical applications of Association Rule Mining (ARM) are market analysis, network traffic analysis, and web log analysis where strategic decisions are made by analyzing the frequent itemsets from a large pool of data. Datasets in such domains are constantly updated and as they require an efficient Frequent Pattern Mining (FPM) algorithm which is capable of extracting the required information. Several incremental algorithms have been proposed to generate frequent patterns, but they are ineffective with very large datasets and do not provide the user interaction to adjust the minimum support value. This paper first presents an efficient interactive sequential FPM algorithm that uses the knowledge gained in the previous mining steps to incrementally mine the updated database with fewer complexities. Then to further reduce the time complexity it proposes an efficient interactive and incremental parallel mining algorithm. It also prepares incremental frequent patterns, without generating local frequent itemsets with less communication and synchronization overheads.
收起
摘要 :
Recently, high utility pattern (HUP) mining is one of the most important research issues in data mining due to its ability to consider the nonbinary frequency values of items in transactions and different profit values for every i...
展开
Recently, high utility pattern (HUP) mining is one of the most important research issues in data mining due to its ability to consider the nonbinary frequency values of items in transactions and different profit values for every item. On the other hand, incremental and interactive data mining provide the ability to use previous data structures and mining results in order to reduce unnecessary calculations when a database is updated, or when the minimum threshold is changed. In this paper, we propose three novel tree structures to efficiently perform incremental and interactive HUP mining. The first tree structure, Incremental HUP Lexicographic Tree ({rm IHUP}_{{rm {L}}}-Tree), is arranged according to an item's lexicographic order. It can capture the incremental data without any restructuring operation. The second tree structure is the IHUP Transaction Frequency Tree ({rm IHUP}_{{rm {TF}}}-Tree), which obtains a compact size by arranging items according to their transaction frequency (descending order). To reduce the mining time, the third tree, IHUP-Transaction-Weighted Utilization Tree ({rm IHUP}_{{rm {TWU}}}-Tree) is designed based on the TWU value of items in descending order. Extensive performance analyses show that our tree structures are very efficient and scalable for incremental and interactive HUP mining.
收起
摘要 :
The FP-growth algorithm using the FP-tree has been widely studied for frequent pattern mining because it can dramatically improve performance compared to the candidate generation-and-test paradigm of Apriori. However, it still req...
展开
The FP-growth algorithm using the FP-tree has been widely studied for frequent pattern mining because it can dramatically improve performance compared to the candidate generation-and-test paradigm of Apriori. However, it still requires two database scans, which are not consistent with efficient data stream processing. In this paper, we present a novel tree structure, called CP-tree (compact pattern tree), that captures database information with one scan ( insertion phase) and provides the same mining performance as the FPgrowth method (restructuring phase). The CP-tree introduces the concept of dynamic tree restructuring to produce a highly compact frequency-descending tree structure at runtime. An efficient tree restructuring method, called the branch sorting method, that restructures a prefix-tree branch-by-branch, is also proposed in this paper. Moreover, the CP-tree provides full functionality for interactive and incremental mining. Extensive experimental results show that the CP-tree is efficient for frequent pattern mining, interactive, and incremental mining with a single database scan.
收起